'■"tt 

1 



(12) INTERNATIONAL APPUCATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 

illllll 



(19) World Intellectual Property 
Organization 
International Bureau 



(43) International Publication Date 
14 October 2004 (14.10.2004) 




(10) International Publication Number 

PCT WO 2004/087965 A2 



(51) International Patent Classification'': 



C12Q 1/68 



(21) International Application Number: 

PCTAJS2004A)09059 

(22) International Filing Date: 24 March 2004 (24.03.2004) 

(25) Filing Language: English 

(26) Publication Language: English 



(30) Priority Date: 
10/401,830 



28 March 2003 (28.03.2003) US 



(71) Applicant (for all designated States except US): COR- 
GENTECH, INC. [USAJS]; 650 Gateway Boulevaid, 
South San Francisco, CA 94080 (US). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): ZHANG, Jie 
[CN/US]; 4930 Poplar Terrace, Campbell, CA 95008 
(US). Vm, Hsiu-YIng [USAJS]; 1039 Redmond Avenue, 



San Jose, CA 95120 (US). MCEVOY, Leslie, Margaret 
[US/US]; 2416 Tamalpais Street, Mountain View, CA 

94043 (US). 

(74) Agent: DltEGER, Ginger, R.; Heller Ehrman White & 
McAuliffe LLP, 275 Middlefield Road. Menlo Paik, CA 

94025-3506 (US). 

(81) Designated States (unless otherwise indicated, for every 
kind of national protection avaUable): AE, AG, AL, AM, 
AT, AU, AZ. BA, BB, BG, BR. BW, BY. BZ, CA. CH. CN, 
CO, CR, CU, CZ. DE, DK, DM, DZ. EC, EE, EG, ES, FI, 
GB, GD, GE. GH, GM, HR, HU, ID, XL, IN, IS, JP, KB, 
KG, ICP, ICR, KZ, LC, LK, LR, LS, LT, LU, LV, MA, MD, 
MG, MK, MN, MW, MX, MZ, NA, NI, NO, NZ, OM, PG, 
PH, PL, PT, RO. RU, SC, SD, SB, SG, SK, SL, SY, TJ, TM, 
TN, TR. TT. TZ, UA. UG, US, UZ, VC, VN, YU, ZA, ZM, 
ZW. 

(84) Designated States (unless otherwise indicated, for every 
kind of regional protection available): AIUPO (BW, GH, 

[Continued on next page] 



(54) TlUe: STATISTICAL ANALYSIS OF REGULATORY FACTOR BINDING SITES OF DIFFERENTIALLY EXPRESSED 
GENES 



< 
00 



o 



FREQUENCIES OF TF BINDING SITES BETWEEN G1 AND 
S PHASE DIFFERENTIAL EXPRESSED GENES AN 
WHOLE GENOME BACKGROUND 



a > 

Z LU 
UJ 



LL 



LU LU 

eg 



0.7 



0.6 



0.5 



0.4 



0.3 



0.2 



0,1 

























• 














c 


• 


E2F-1 
• 








o " 
C. — Cl 










• • . 


C» ■_ 




E2F 
• 






a 6 








* E2F-1/ 


3P-1 






u 


• 











0.1 0.2 0.3 0.4 0.5 0.6 

FREQUENCIES OF TF BINDING FOR G1/S PHASE 
DIFFERENTIAL EXPRESSED GENES 



0.7 



(57) Abstract: The invention 
concerns the statistical analysis 
of regulatory factor binding sites 
of differentially expressed genes. 
More particularly, the invention 
concerns methods for identifying 
and characterizing regulatory 
factor, e.g. transcription factor 
binding sites in diCTerentially 
expressed genes in order to 
develop therapeutic strategies for 
the treatment of diseased which 
are accompanied by differential 
gene expression or to study 
biological processes. 



wo 2004/087965 A2 lililiiillillllllililllllllliilliilllllliillillll^ 



GM, KB, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZM, ZW), 
Eurasian (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), Euro- 
pean (AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, Fl, FR, 
GB, GR, HU, IE, IT, LU, MC, ML, PL, PT, RO, SE, SI, SK, 
TR), OAPI (BF, BJ, CF, CO, CI, CM, OA, GN, GQ, GW, 
ML, MR, NE, SN, TD, TG). 



Published: 

— without international search report and to be republished 
upon receipt of that report 

For two-letter codes and other abbreviations, refer to the "Guid- 
ance Notes on Codes and Abbreviations " appearing at the begin- 
ning of each regular issue of the PCT Gazette. ■ 



wo 2004/087965 



PCTAJS2004/009059 



STATISTICAL ANALYSIS OF REGULATORY FACTOR BINDING SITES OF 
DIFFEBENTLVLLY EXPRESSED GENES 

Background of the Invention 

Field of the Invention 

The present invention concerns tiie statistical analysis of regulatory factor binding sites of 
differmtially expressed genes. More particularly, the invention concerns methods for identifying and 
characterizing regulatory factor, e.g. transcription factor binding sites in differentially expressed genes 
in order to develop &erapeutic strategies for the treatment of diseases which are accompanied by 
differential gene expression. 

Description of the Related Art 

One of the main approaches to identify novel th^peutic targets is the study of differential 
gene expression, typically coinparing normal and diseased biological samples, or biological samples 
representative of different stages of a particular disease or pathologic condition. In general, methods 
used to study differential gene expression can be based on hybridization analysis and/or sequencing of 
polynucleotides. The most commonly used metiiods known in the art for the quantification of 
differential gene expression in a sanq}le include noithem blotting and in situ hybridization (Parker & 
Barnes, Methods in Molecular Biology 106:247-283 (1999)); polymerase chain reaction (PGR) (Weis 
et al., Trends in Genetics 8:263-264 (1992)), such as quantitative real-time PGR, and microarray 
analysis. Alternatively, antibodies may be employed that can recognize specific duplexes, including 
DNA duplexes, RNA duplexes, and DNA-RNA hybrid duplexes or DNA-protein duplexes. 
Representative ipethods for sequencing-based gene expression analysis include Serial Analysis of 
Gene Expression (SAGE), and gene expression analysis by massively parallel signature sequencing 
(MPSS). 

Differential gene expression studies have been conducted on a variety of human tissues and 
biological sanq)les representing a verity of biological processes, such as various cancers, neuronal 
diseases, developmental disorders, aging processes, infectious diseases, and tiie like. 

Summary of the Invention 
The present invention is based on the recognition that the large number of differentially 
expressed genes identified in a biological sample, which may be, but need not be, representative of 
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various diseases, disease states and other abnormalities, is the result of changes in the transcription 
fiinctioning of a handful of regulatory factors, such as transcription factors (TF)- 

Li one aspect, the present invention concerns a method for statistical analysis of differentially 
expressed genes, comprising; 

(a) obtaining a set of difEerentially expressed genes ; 

(b) screening genomic sequences including the regulatory regions of the 
differentially expressed genes for the presence of regulatory feclor binding sites; and 

(o) identifying at least one regulatory factor binding site enriched within the set 
of differentially expressed genes relative to a genome-wide or tissue-wide background. 

The set of differentially expressed genes can be obtained from results of differaitial gene or 
protein expression studies, and flius can, for exan^le, be generated by microarray, RT-PCR, or 
proteomics approaches. 

Jn step (c) enrichment may, for example, be determined by comparing the frequencies or 
probabilities of the occurrence of the regulatory bindmg site or binding sites identified in step (c) 
within the gene set. 

to a particular embodiment, tiie set of differentially expressed genes may be part of a gene 
expression profile characteristic of a disease, disorder, or biological process. All diseases, disorders 
and biological processes associated wife gene transcription are included, such as, without limitation, 
tumor, oncological diseases, neurological diseases, cardiovascular diseases, renal diseases, infectious 
diseases, digestive diseases, metabolic diseases, inflammatory diseases, autoimmune diseases, 
dermatological diseases, and diseases associated virith trauma or abnormal skeletal development. 
Metabolic diseases specifically include, wifeout limitation, diabetes, and diseases of lipid, 
carbohydrate and calcium metabolism. Dermatological diseases specifically include, without 
limitation, diseases requiring wound healing. 

In a furtiier specific embodiment, the disease is cancer, which can, for example, be breast 
cancer, renal cancer, leukemia, colon cancer, lung cancer, prostate cancer, hepatocellular cancer, 
gastric cancer, pancreatic cancer, cervical cancer, ovarian cancer, liver cancer, bladder cancer, cancer 
of the urinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, and brain cancer. 

In another embodiment, the disorder is a developmental disorder. 

In yet another embodiment, flie biological process represented by tiie differentially expressed 

gene set is associated with aging. 

In a fiirther embodiment, the gene set consists of genes that show at least about two-fold , or 
at least about four-fold, or at least about ten-fold differential expression relative to control. 
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In a still further embodiment, the regulatory factor binding site is identified within a 5' 
upstream core promoter region, a 5' upstream enhancer region, an intron region, and/or a 3' regulatory 
region. 

In anotiier embodiment, the regulatory factor binding site is a transcription factor binding site. 
Without limitation, and merely by way of illustration, the transcription factor can be selected from the 
group consisting of c-Fos, c-Jun, AP-1, Elk, ATF, c-Ets-1, c-Rel, CBF, CTF, GATA-1, POUIFI, NF- 
kB, P0U2F1, POU2F2, p53, Pax-3, Spl, TCP, TAR, TFEB, TCF-1, IFHF, E2F-1, E2F-2, E2F-3, 
E2F-4, HIF-1, HIF-la, HOXAl, H0XA5, Sp3, Sp4, TCF-4, APC, and STAT5A. 

In a specific embodiment, flie transcription fector is E2F-1, E2F-2, E2F-3, NF-kB, Elk, AP-1, 
c-Fos, or c-Jun. 

Typically, a large number of differentially expressed genes is analyzed. Thus, the analysis 
may extend to at least about 100 differentially expressed genes, or at least about 500 differentially 
expressed genes. 

Jn a fiirtiier aspect, the invention concerns method for designmg a treatment strategy based 
upon the identification of the enriched regulatory factor binding site(s) by the foregoing me&od. 

In a specific embodiment, the enriched regulatory factor binding site is a transcription factor 
binding site binding to at least one transcription factor. 

In a further embodiment, a consensus binding site is identified based on the enriched 
transcription &ctor binding site. 

The treatment strategy may, for exsaaple, rely on the design of a double-stranded 
oligonucleotide decoy, which competes with said enriched binding site for binding to the 
corresponding transcription factor, or on an anti-sense oligonucleotide designed to bind to the mRNA 
of enriched transcription factor. 

In a different aspect, the invention concerns a method of designing a consensus regulatory 
fector binding site, comprising identifying a regulatory factor binding site enriched witiiin a set of 
differentially expressed genes, relative to a genome-wide or tissue-wide control, and designing a 
consensus regulatory fector binding site consisting essentially of nucleotides shared by the regulatory 
factor binding sites enriched wiHiin the set of differentially expressed genes. 

Li yet another aspect, the invention concerns a method of analyzing the enrichment of a 
regulatory factor binding site in a biological sample comprising a set of differentially expressed genes, 
comprising comparing the frequency or probability of the occurrence of the regulatory binding site 
within the gene set with flie frequency or probability of its occiirrence in a reference sample. The 
statistical analysis is preferably performed by using a hypergeometric distribution model. 
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Brief Description of Drawings 

Figure 1 shows flie ftequencies of TF binding sites between Gl and S phase differentially 
expressed genes and whole genome background. 

Figure 2 is a graphical representation of tiie number of microarray-related publications 
between 1995 and 2002. 

Detailed Description of the Preferred Embodiment 
A. Definitions 

Unless defined otherwise, technical and scientific terms used herein have the same meaning as 
commonly understood by one of ordinary skill in tiie art to which this invention belongs. Singleton et 
al, Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, NY 
1994), and March, Advanced Organic Chemistry Reactions, Mechanisms and Structure 4th ed., John 
Wiley & Sons (New York, NY 1992), provide one skilled in tiie art wifli a general guide to many of 
the terms used in the present application. 

For piuposes of the present invention, flie following terms are defined below. 

The term "regulatory factor," is used in the broadest sense, and includes any fector that is 
capable of affecting the mRNA transcription process of genes. Specifically included within this term 
are transcription fiictors 

The terms "gene regulatory sequence," "cis-regulatory element," "cis-acting regulatory 
element," "cis-regulatory sequence," and "cis-acting regulatory sequence" are used interchangeably, 
and refer to any regulatory sequence that controls gene e^qjression, including, witiiout limitation, 5' 
regulatory regions and 3'-regulatory regions, such as, promoters, enhancers, silencers, transcription 
termination signals, and splicing signals; intron regions, and intergenic regions, and sequences that 
regulate translation. Specifically included are DNA recognition sequences with which transcription 
factors associate (also referred to as transcription factor binding sites). 

The term "transcription factor binding site" refers to short consensus genomic sequences that 
locate immediately before the transcription start sites (TSS) of genes. A transcription regulatory 
region can contain several binding sites, and can therefore be bound by several transcription fiictors. 

"Trans-factors" are proteins that bind to cis-regulatory sequences. 

"Transcription factors" are proteins that bind to DNA near the transcription initiation site of a 
gene, and either assist or inhibit RNA polymerase in initiation and maintenance of transcription. 

"DNA binding domain" is a region witiiin a transcription factor that recognizes specific bases 
in a target gene near the transcription initiation site. 

The "transcription starting site (TSS)" is the position where a gene's mRNA starts to be 
transcribed from DNA by RNA polymerase n 
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The tenn "transcription &ctor decoy" or "decoy" is used herein to refer to short double- 
stranded oligonucleotides that specifically bind target transcription &ctors, thereby preventing the 
transcription &ctors fiom initiating die transcription of their target genes. 

The term "microarray" . refers to an ordered arrangement of hybridizable array elements, 
preferably polynucleotide probes, on a substrate. 

The term "polynucleotide," when used in singular or plural, generally refers to any 
polyribonucleotide or polydeoxribonucleotide, which may be unmodified KbTA or DNA or modified 
RNA or DNA. Thus, for instance, polynucleotides as defined herein include, without limitation, 
single- and double-stranded DNA, DNA including single- and double-stranded regions, single- and 
double-stranded RNA, and KNTA including single- and double-stranded regions, hybrid molecules 
conqnising DNA and RNA that may be single-stranded or, more typically, double-stranded or include 
single- and double-stranded regions. In addition, the term "polynucleotide" as used herein refers to 
triple-stranded regions comprising RNA or DNA or both RNA and DNA. The strands in such regions 
may be torn the same molecule or from different molecules. The regions may include all of one or 
more of the molecules, but more typically involve only a region of some of the molecules. One of the 
molecules of a triple-helical region often is an oligonucleotide. The term "polynucleotide" 
specifically includes cDNAs. The term includes DNAs (including cDNAs) and RNAs that contain 
one or more modified bases. Thus, DNAs or RNAs with backbones modified for stability or for other 
reasons are "polynucleotides" as that teaa is intended herein. Moreover, DNAs or RNAs comprising 
unusual bases, such as inosine, or modified bases, such as tritiated bases, are included within the term 
"polynucleotides" as defined herein. In general, the term "polynucleotide" embraces all chemically, 
enzymatically and/or metabolically modified forms of unmodified polynucleotides, as well as the 
chemical forms of DNA and RNA characteristic of viruses and cells, including sinq>le and covaplex. 
cells. 

The torn "oligonucleotide" refers to a relatively short polynucleotide, including, without 
limitation, single-stranded deoxyribonucleotides, single- or double-stranded ribonucleotides, 
RNA:DNA hybrids and double-stranded DNAs. Oligonucleotides, such as single-stranded DNA 
probe oligonucleotides, are often synthesized by chemical methods, for example using automated 
oligonucleotide synthesizers that are commercially available. However, oligonucleotides can be 
made by a variety of other methods, including in vitro recombinant DNA-mediated techniques and 
by expression of DNAs in cells and organisms. 

The terms "differentially expressed gene," "differential gene expression" and their synonyms, 
which are used interchangeably, refer to a gene whose expression is activated to a higher or lower 
level in a sample obtained from a subject suffering from a disease, relative to its expression in a 
normal or control (reference) sample. The terms also include genes whose expression is activated to a 
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higher or lower level at different ^ages of the sanxe disease. A differentiaUy expre^ed gene may be 
either activated or inhibited at the nucleic acid level or protein level, or may be subject to altematve 
sphcing to result in a different polypeptide product. Such differmces may. for example, be evidenced 
by a change in mRNA levels, surface expression, secretion or other partitioning of a polypeptide. 
Differential gene expression may include a comparison of expression between two or mox^ genes or 
their gene products, oracomparison of Ihe ratios of the expression between two or more genes or their 

gene products, or even a comparison of two di^tly processed products of the same gene, which 
differ between normal subjects and subjects suffering from a disease, or between various stages of the 
same disease. Differential expression includes both quantitative, as well as quaHtative. differences m 
the temporal or cellular expression pattern in a gene or its expression products among, for example, 
normal and diseased cells, or among cells which have undergone different disease events or disease 
stages For the purpose of this invention, "differential gene expression" is considered to be 
"significanf when there is at least an about two-fold, preferably at least about four-fold. more 
preferably at least about six-fold, most preferably at least about ten-fold difference between the 
expression of a given gene in nomial and diseased subjects, or in various stages of disease 
development in a diseased subject. 

A "set" of differentially expressed genes includes sufficient number of genes for statistical 
analysis, in general, the set will include at least about 20,or at least about 50. or at least aboutlOO,^ 

at least about 200. or at least about 500, or at least about 1000 genes. 

The term "treatment" refers to both therapeutic treatment and prophylactic or preventative 
n^s. wherein the object is to prevent or slow down Gessen) the targeted pathologic condition or 
disorder. Those in need of treatment include those aheady with the di«)rder as well as those prone to 
have the disorder or those in whom the disonier is tobe prevented. In tumor(e.g..cancer)treat^^^^^ 
therapeutic agent may directly decrease the pathology of tumor cells, or render the tumor cells more 
susceptible to treatment by other therapeutic agents, eg., radiation and/or chemotherapy. 

The term "tumor." as used herein, refers to all neoplastic cell growth and proliferation, 
whether malignant or benign, and all precancerous and cancerous cells and tissues. 

The terms "cancer" and "cancerous" refer to or describe the physiological condition m 
n^als that is typically characterized by unregulated cell gro^..h. Examples of cancer include but 
are not limited to. breast cancer, colon cancer, lung cancer, prostate cancer, hepatocellular cancer, 
gastric cancer, pancreatic cancer, cervical cancer, ovarian cancer, liver cancer, bladder cancer, cancer 
of the urinary tract, thyroid cancer, renal cancer.- carcinoma, melanoma, head and neck cancer, and 

brain cancer. „ , 

The "pathology" of cancer includes all phenomena that compromise the well-being of the 
patient, "ms includes, without limitation, abnormal or uncontrollable cell growth, metastasis. 
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interference with the normal functioning of neighboring cells, release of cytokines or other secretory 
products at abnormal levels, suppression or aggravation of inflammatoiy or immunological response, 
neoplasia, premalignancy, malignancy, invasion of surrounding or distant tissues or organs, such as 
lynq>h nodes, etc. 

B. Detailed Description 

The practice of the present invention will employ, unless ottierwise indicated, conventional 
techniques of molecular biology (including recombinant techniques), microbiology, cell biology, and 
biochemistry, which are within the skill of the art. Such techniques are explained fully in the 
literature, such as, "Molecular Cloning: A Laboratory Manual", 2"* edition (Sambrook et al., 1989); 
"Oligonucleotide Synthesis" (M.J. Gait, ed., 1984); "Animal Cell Culture" (R-L Freshney, ed., 1987); 
"Methods in Enzymology" (Academic Press, Lie); "Handbook of Experimental Immunology", 4* 
edition (D.M. Weir & C.C. Blackwell, eds., Blackwell Science Inc., 1987); "Gene Transfer Vectors 
for Mammalian Cells" (JM. Miliar & M.P. Calos, eds., 1987); "Current Protocols in Molecular 
Biology" (F.M. Ausubel et al., eds., 1987); and "PGR: The Polymerase Chain Reaction", (MuUis et 
al., eds., 1994). 

The present invention is based on the systematic comparison of the regulatory regions of 
genes identified as being differentially expressed in a particular disease, disease state, or abnormality. 
In particular, the present invention is based on the recognition that a common link among the 
numerous differentially repressed genes is change in tiie transcription processes of a handful of 
regulatory, e.g. transcription, factors. 

As noted before, researchers have a variety of techniques at their disposal to study differential 
gene expression. Although the most frequently used approaches are microarray and RT-PCR, other 
techniques, such as Northern blotting, RNase protection assays, differential plaque hybridization, 
subtractive hybridization, serial analysis of gene expression (SAGE; Velculescu et al. Science 
270:484-487 (1995); and Velculescu et al.. Cell 88:243-51 (1997)), rapid analysis of gene expression 
(RAGE; Wang et al., Nucleic Acids Research, 27:4609-18, (1999)), and massively parallel signature 
sequencing (MPSS; Brenner et al.. Nature Biotedinology 18:630-634 (2000)), are equally suitable for 
tiie study of differential geas expression. More and more studies have been conducted about the 
differential gene expression. Figure 2 gives an outline about the publications of microarray technology 
based all biomedical researches or cancer specific researches. 

In the microarray method, polynucleotide sequences of interest (including cDNAs and 
oligonucleotides) are plated, or arrayed, on a microchip substrate. The arrayed sequences are then 
hybridized with specific DNA probes from cells or tissues of interest. In a specific embodiment of tiie 
microarray technique, PGR amplified insots of cDNA clones are applied to a substrate in a dense 
array, typically including at least about 10,000 nucleotide sequences. The immobilized microarrayed 
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genes are suitable for hybridization under stringent conditions. Fluorescently labeled cDNA probes 
appbed to the chip hybridize witii specificity to each spot of DNA on the array. After stringent 
washing to remove non-specifically bound probes, the chip is scanned by confocal laser microscopy or 
by another detection method, such as a CCD camera. Quantitation of hybridization of each arrayed 
element allows for assessment of corresponding rnRNA abundance. With dual color fluorescence, 
separately labeled cDNA probes generated from two sources of KNA are hybridized pairwise to the 
array. The relative abundance of the transcripts from the two sources corresponding to each specified 
gene is thus determined simultaneously, thereby providing differential gene expression data. 
Microarray analysis can be performed by commercially available equipment, following manufacturer's 
protocols, such as by using the Afifymetrix GenChip technology, or Agilent's microarray technology. 

RT-PCR can also be used to compare mRNA levels in different sample populations, such as 
in normal and diseased (e.g. tumor) tissues to characterize patterns of gene expression, to discriminate 
between closely related mRNAs, and to analyze KNA structure. 

The first step is the isolation of mRNA from a target sample. As RNA cannot serve as a 
template for PGR, tiie first step in gene expression profiling by RT-PCR is the reverse transcription of 
the RNA template into cDNA, followed by its exponential amplification in a PGR reaction. The two 
most commonly used reverse transcriptases are avilo myeloblastosis virus reverse transcriptase 
(AMV-RT) and Moloney murine leukemia virus reverse transcriptase (MMLV-RT). The reverse 
transcription step is typically primed using specific primers, random hexamers, or oligo-dT primers, 
depending on the circumstances and the goal of expression profiling. For example, extracted RNA 
can be reverse-transcribed using a GeneAmp RNA PGR kit (Perkin Ebner, CA, USA), following the 
manufacturer's instructions. The derived cDNA can then be used as a tenq)late in the subsequent PCR 
reaction. 

A more recent variation of the RT-PCR technique is the real time quantitative PCR, which 
measures PCR product accumulation through a dual-labeled fluorigenic probe (i.e., TaqMan® probe). 
Real time PCR is compatible both wifli quantitative competitive PCR, where intemar competitor for 
each target sequence is used for normalization, and with quantitative comparative PCR using a 
normalization gene contained wittiin the sample, or a housekeeping gene for RT-PCR. For fiirther 
details see, e.g. Held et al. Genome Research 6:986-994 (1996). 

Differential gene expression can also be studied at the protein level, using proteomics 
techniques. The proteome is the totaUty of the proteins present in a sample (e.g. tissue, organism, or 
cell culture) at a certain point of time. Proteomics includes, among other things, study of flie global 
changes of protein expression in a sanple (also referred to as "expression proteomics"). Proteomics 
typically includes tiie following steps: (1) separation of individual proteins in a sample by 2-D gel 
electrophoresis (2-D PAGE); (2) identification of the individual proteins recovered from the gel, e.g. 
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mass spectrometry and/or N-terminal sequencing, and (3) analysis of the data using bioinformatics. 
Proteomics methods are valuable supplements to other methods of gene expression profiling, and can 
be used, alone or in combination with other methods, to study differential gene expression. For 
fiirther details see, e.g. Proteomics in Practice: A Laboratory Manual of Proteome Analysis. R. 
Westermeier et al, eds., John Wiley & Sons, 2002. 

Typically, gene expression studies identify hundreds to a few thousands of differentially 
expressed genes in Qie te^ samples, relative to normal samples. For exanq>le, studies in normal 
biological processes, such as HeLa cell cycles, and abnormal biological phenotype, such as rotavirus 
infected tissue, have shown ^t at least about 500 genes exhibit significant changes relative to their 
normal counterparts. Most of the gene expression data have been deposited into public and 
commercial databases, such as Stanford Micoarray Database (SMD), Yale Microarray Database, 
ArrayExpress at the European Bioinformatics Institute lEBI). These, and o&er publicly available gene 
expression databases are listed in Table 1 below. 



Table 1 



Name of database 


Description 


ArrayExpress 


A repository for microarray based gene 
expression aaia mamiaineQ oy curopean 
Bioinformatics Institute. 


ChipDB 


A searchable database of gene expression. 


ExpressDB 


A relational database containing yeast and E. coli 
RNA expression data. 


Gene Expression Atlas 


A database for gene expression profile from 91 

normal human and mouse samples across a 
diverse array of tissues, organs, and cell lines. 


Gene Expression Database (GDX) 


A database of Mouse Genome Informatics 
at the Jackson laboratory. 


Gene Expression Omnibus 


A database in NCBI for supporting the 
public use and disseminating of gene 
expression data. 


GeneX 


National Center for Genome Resource's 
imitative to provide an Ihtemet-available 
repository of gene expression data. 


Human Gene Expression Index O^uGE Index) 


Aims to provide a comprehensive database 
to understand the expression of human 
genes in normal human tissues. 


M-CHiPS (Multi-Conditional Hybridization 
Bitensity Processing System) 


A data warehousing concept and focuses on 
providing a structure suitable for statistical 
analysis of a microarray database's entire 
components including the experiment 
annotations. 


READ (RIKEN cDNA Expression Array 
Database) 


A database maintained by RIKEN (The institute 
of Physical and Chemical Research), Japan. 


RNA Abundance Database (RAD) 


RNA Abundance Database (ElAD) is a public 
gene expression database designed to hold data 
from array-based and non-array-based (SAGE) 
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experiments. The ultimate goal is to allow 
comparative analysis of experiments performed 
by different laboratories using different platforms 
and investigating diff^ent biological systems. 


Saccharomyces Genome Database 
(SGD):Expression Connection 


A gene expression database of Saccharomyces 
genome at Stanford University; provides 
simultaneous search of the results of several 
microarray studies for gene e^qnession data for a 
given gene or GRP. 


Stanford Micioarray Database (SMD) 


Stores raw and normalized data from microarray 

experiments, as well as their corresponding 
image files. In addition, SMD provides 
inter&ces for data retrieval, analysis and 
visualization. Data is released to the public at the 
researcher's discretion or upon publication. 


Yale Microarray Database 




Yeast Microarray Global Viewer 


A database for yeast gene expression data 
maintained by Laboratoire de genetique 
moleculaire, Ecole Normale Superieure. 


3D-Gene Expression Database 


Preliminary structure for a database of 3D- 
visualization of developmental gene expression. 


BODYMAP 


A databank of gene expression information of 
himian and mouse genes, created by random 
sequencing of clones in 3'-directed cDNA • 
libraries. 


Gene Resource Locator 


The goal is to map millions of ESTs to the human 
genome for the study of the exon-intron 
structures of genes, the alternative splicing of 
pre-mRNAs, the promoter regions of ftiU-length- 
enriched cDNA sequences, and the gene- 
expression patterns associated with ESTs. 


RNA Abundance Database (RAD) 


A public gene expression database designed to 
hold data from array-based and non-array-based 
(SAGE) experiments. The ultimate goal is to 
allow comparative analysis of experiments 
performed by different laboratones usmg 
different platforms and investigating different 
biological systems. 


Tissuelnfo 


An online database which determines the tissue 
expression profile of a sequence by comparing 
the given sequence against the EST database. 
Each EST comes firom a library derived firom a 
specific tissue type. 



Despite extensive research in this field and the large volume of accumulated data, in view of 
the complexity of gene expression, differential gene expression data are difficult to interpret. 

It has been well accepted that it is very unlikely that each of the numerous differentially . 
expressed genes has mutations or some otha: defects. On the contrary, it is possible that the large 
number of differentially expressed genes is the result of changes in a few k^ phenomena or 
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mechanisms, \^ch can affect simultaneously the expression levels of many genes. The present 
invention is based on the recognition that the large number of differentially expressed genes in various 
diseases, disease states or other abnormalities results from changes in a few regulatory &ctors, such as 
transcription fectors (TF). 

Transcription factors (TFs) are a class of proteins that control and initialize the process of 
transcribing genetic information coded by DNA into mRNA. All currently known TFs are classified 
into five different subfamilies, named after dieir functional domains, namely the Basic Domains, Zinc- 
coordinating DNA binding domain, Helix-tum-helix domains, beta-Scaffold Factors with Minor 
Groove Contacts, and Other Transcription Factors. Usually, at least a few transcription factors are 
required to form a transcriptional conq)lex that binds to the regulatory regions of genes and, as a 
result, controls and initializes the mKNA transcription machinery. These binding processes are 
mediated by the DNA binding domains of TF proteins. It is known that only some of the transcription 
factors are capable of binding directly to DNA, while others are required to form the functional 
transcription machinery, without the requirement of direct binding to the regulatory regions of the 
target genes. 

At the present time, there are more than 4000 known TF's, about 2000 of which are fiiom 
mammalian species. Exenqilary TFs, without limitation, include o-Fos, c-Jun, AP-1, ATF, c-Ets-1, c- 
Rel, CRF, CTF, GATA-1, POUIFI, NF-kB. P0U2F1, POU2F2, p53, Pax-3, Spl, TCP, TAR, TFEB, 
TCF-1, TFIIF, E2F-1, E2F-2, E2F-3, E2F-4, HIF-1, HIF-la, HOXAl, H0XA5, Sp3, Sp4, TCF-4, 
APC,andSTAT5A. 

Of the mammalian TFs, only several himdred have been shown to have the ability to bind 
directly to the regulatory regions (cis-regulatory binding sites) of the target genes, and only a few 
hundred TF binding sites have been charactaized up to date. The TF binding sites of genes are short 
stretches of DNA sequences located in the regulatory region of the genes. These sites are specific for 
different DNA binding TFs, and usually are about 6 to about 16 bases in length. It is known tfiat 
witiim a given binding site there are bases at certain positions that are absolutely required for binding 
by the corresponding TF, while others can tolerate some base-change variations. For further details 
see, for example, Davidson, E.H., Genomic Regulatorv Systems: development and evolution. ISBN 0- 
12-205351-6, Academic Press, 2001. and, for example, Michael Carey, Stephen T. Smale, 
Transcriptional Regulation in Eukaryotes, ISBN 0-87969-537-4, Cold Spring Harbor Laboratory 
Press, 2000. 
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There are several transcription factor related databases, which are listed in Table 2 below. 



Table 2 



Database 


TF 


Sites 


Address 


TRANSFAC 


&ctors 


sites 


http://transfac.gbf.de/TRANSFAaindexiitnil 


TRRD 


factors 


sites 


http://wwwnigs.bionet.nsc ju/mgs/gnw/trrd 


TFD 


factors 


sites 


http://ldsec.cmb.ki.se/ldsac/databases/tfd.htnil 


COMPEL 


coinpository 


sites 


http://conipel.bbionet.nsc.ru/ 


EPD 


N/A 


promoters 


http://www.epd.isb-sib.ch/ 


IMD 


Victors 


sites 


http://bimas.dcrtnih.giv/niolbio/matrixs/ 



Of the listed databases TRANSFAC collects the most in terms of number of TF binding sites, 
and is updated and cited frequently (Heinemeyer et al., 1998, Heinemeyer et al., 1999, Karas et al., 
1997, Knuppel et al., 1994, Matys et al., 2003, Wingender et al., 1996, Wingender et al., 1997, 
Wingender et al., 1997, Wingender et al., 2000., Wingender et al., 2001). The usage of TF binding 
sites for protein-pathway evaluation has been recently reported (KruU et al, 2003). 

In the broadest sense, the present invention provides, for the first time, a method for libs 
comparative analysis of regulatory regions of a large number of genes in order to identify common 
regulatory mechanisms and/or consensus regulatory factor binding sites shared by such genes. 
Accordingly, the present invention provides new insight into so far imdiscovered relationships 
between such genes, and enables the identification of significant regulatory factors from the large 
amount of gene expression data available at fbs present time or to be generated in the future. 

The idea underlying the present invention is that if one can identify certain consensus 
regulatory factor binding sites, such as, for example, TF binding sites, shared by most of the 
differentially expressed genes identified in various diseases, diseases states or abnormalities. If the 
certain regulatory &ctor, e.g. TF binding sites are found enriched among such differentially expressed 
genes relative to their tissue-wide or genome-wide existences, the identified binding sites very likely 
play a major role in the resultant differential expression and, in turn, could be responsible for fhs 
disease or abnormalities, such as the final cell-fate change seen in cancer or tumor. 

In one particular aspect, the present invention provides a novel approach for comparative 
analysis of regulatory regions of differentially expressed genes in order to identify consensus 
regulatory regions enriched within such genes, which can thai be used to identify one or more 
regulatory &ctors that play a role in the regulation of their expression. 

hi anoflier aspect, the present invention provides a method for identifying regulatory factors, 
such as transcription fectors (TFs), providing a link among the large number of genes differentially 
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e}q)ressed in a disease, disease state or abnormality, by a systematic conqjarison of their regulatory 
regions. 

As a result of their involvement in an essential regulatory mechanism associated with a 
disease process, the shared regulatory factor binding sites and the corresponding regulatory factors are 
valuable therapeutic-development targets. For example, by altering the TFs identified, for example, 
by antisoise oligonucleotide approach (to bind the mRNA of the TF and in turn to alter the 
corresponding protein expression) or by changing the transcription effects of such TFs, e.g. by using 
the transcription decoy mefliod (to competitively bind to corresponding TFs), new approaches can be 
developed for tiie treatment (including prevention) of a variety of diseases, disorders, and 
abnormalities, or for interfering with certain detrimental or undesired biological processes, such as 
aging, hi a more generic sense, tiie present invention provides a valuable tool for biomedical studies 
and research efforts in general, and provides a unique tool for understanding such processes. In 
general, the information provided by the present invention can be utilized for a variety of different 
purposes and applications including but not limited to, biomedical research, pre-clinical development, 
drug screening applications, target discovering and target validation, building genome- or tissue-wide 
coimections between regulatory profiles of different genes, imderstanding the genome or tissue 
background of various known regulatory factors, understanding the genome or tissue bacl^ound of 
various known transcription &ctors, and the like. 

Accoidingjly, the present invention is directed to a method for Has statistical analysis of the 
regulatory &ctoT (e.g. TF) binding sites of differentially expressed genes, hi a particular aspect, the 
present invention provides new therapeutic targets by identifying regulatory, e.g. transcription factors, 
that have been responsible for the differential expressions of a large number of genes found in a 
biological sample representative of a disease, disorder, or a particular biological process. 

hi a particular embodiment, the me&od of the present invention comprises the following 
steps: (1) the generation of a list of genes with significant differential expression; (2) the identification 
of cis-regulatory regions within the differentially expressed genes; (3) the mapping of transcription 
factor bmding sites on the cis-regulatory regions identified; and (4) the statistic analysis of the 
identified TF binding profiles. 

(1) The generation of the list of genes with significant differential expression. 

The gene expression data can be retrieved fi'om various gene expression related databases. 
These databases are not limited to those generated by microarray techniques. They can also include 
gene expression data obtained by real-time quantitative PGR, Northern blot hybridization,- and other 
gene expression related methods, including proteomics. Exenplaiy databases of gene expression data 
are listed in Table 1 above. In addition to these ateady available data sets, the differentially expressed 
gene list can also be generated by any project-oriented specific experiments, using any of the 

13 



wo 2004/087965 



PCT/US2004/009059 



techniques discussed above, or otherwise known in the art. According to the invention, the data 
retrieved from such databases, or from any o&er source, are intensively analyzed, especially when the 
data involve a large number of genes or gene sets (e.g., such as SAM analysis). A list of genes 
showing significant differmtial expression is generated, and assigned flie respective gene identifiers, 
based on the international nomenclature committee and other genome databases, using self generated 
scripts. As noted before, differential geae expression is considered to be "significant" when there is at 
least an about two-fold, preferably at least about four-fold, more preferably at least about six-fold, 
most preferably at least about ten-fold difference between the expression of a given gene in a test and 
a reference sample, such as in normal and diseased subjects, or in various stages of disease 
development in a diseased subject. 

(2) TTie identification of ds-regulatory regions of differentially expressed genes. 

Based on the gene list generated in (1), the full-length sequences of these genes are retrieved 
from various Ml-length gene databases (such as NCBI based refSeq, NTH based MGC consortium, 
Japan DBTSS, and the like) (Pruitt et al., 2001, Strausberg et al., 1999, Strausberg RL et al., 2002, 
Yamashita et al., 2001). These full-length sequences are then compared with most updated human 
genome sequence databases (Lander et al., 2001, McPherson et al., 2001) (such as Human Genome 
Working Draft, build 31, Nov 2002) for mapping their chromosomal location using, for example, the 
BLAT software (Kent, 2002). Depending on the particular purpose, the cis-regulatory region, such as, 
for example, the 5' upstream core promoter region, the 5' upstieam enhancer region, intron region, 
and/or 3' regulatory region, is defined and the corresponding genomic sequences are retrieved from 
the most up-dated genome sequence databases (UCSC genome browser) (Kent et al., 2002, Karolchik 
et al., 2003). If necessary, the sequence-retrieving process can be facilitated by using self-developed 
scripts. 

(3) Mapping of regulatory factor binding profiles on the cis-regulatory regions identified. 

The genomic sequences for regulatory regions identified are screened for any putative 
regulatory factor binding sites, such as TF binding sites. For instance, the core promoter regions of 
the differentially expressed genes can be analyzed using known transcription factor binding sites. 
Software available for this kind of analysis is disclosed, for example, in the following pubUcations: 
Grabe, 2002, Kel-Margoulis et al., 2000, Kel et al., 1995, Liebich et al., 2002, Perier et al., 2000, Praz 
et al., 2002, Prestridge, 1996, Quandt et al., 1995, Tsunoda et al., 1999, and Wingender, 1994. These 
genomic sequences of regulatory regions can be ftirther screened for putative cis-regulatory binding 
sites using various motif-finding software. This can be instrumental in mapping unknown 
transcription factor binding sites unknown regulatory factor consensus motifs. 
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(4) Statistic analysis of the regulatory factor binding profiles. 

The putative regulatory factor binding sites identified in the differentially expressed genes are 
compared with their genome-wide or tissue-wide occurrence. The number of such binding sites, the 
fi-equencies of such binding profiles and the distribution and frequencies of occurrence are calculated, 
using statistical analysis. Statistical analysis can be performed, for example, using the hypergeometric 
distribution models, which determine the total number of successes in a fixed size sample drawn 
without replacement fi-om a finite population. In particular, the hypergeometric distribution analysis 
(by using Microsoft Excel building function in combination with self-developed script) can be used to 
test if the appearances of certain regulatory factor (e.g. TF) binding sites are significantly enriched in 
the differential expression gene list. Such enrichment may result in abnormalities, such as tumor, e.g. 
cancer, when comparing with the genomic or tissue background. If necessary, the regulatory factor, 
e.g. TF can be identified and its sequence provided, based upon such statistical analysis. Such 
regulatory factors, e.g. TPs are valuable targets for therapeutic intervention directed to the prevention 
or treatment of diseases, disorders, or unwanted biological processes. 

It will be apparent to those skilled in the art that other statistical methods can also be 
employed, as long as they are suitable for the comparison of frequencies or probabilities of the 
occurrences of regulatory regions in the genes identified in any two gene sets. 

In a particular embodiment, the cis-regulatory regions, e.g. regulatory factor binding sites, of 
differentially expressed genes are identified by the method disclosed in co-pending application Serial 
No. 10/402.689. filed on March 28. 2003 . In brief, according to this approach, genomic sequences of 
gene regulatory regions are retrieved, from public and/or proprietary databases, DNA sequence 
information for each retrieved gene regulatory region is screened to identify putative regulatory 
factor binding sites, the putative regulatory factor binding sites are profiled, and probabihty mapping 
is applied to the profiled binding sites. The probability mapping involves the identification of 
specific regulatory factor binding sites, such as all Has putative E2F-1 transcription &ctor binding 
sites, in the regulatory regions of all genes in a gene set, e.g. a set of differentially expressed genes in 
a particular disease, disease state, abnormality, and the like. The probability mapping tells how many 
of the differentially expressed genes are likely to be transcription-regulated by a specific regulatory 
factor. It also indicates how much genome-wide, cell-wide, or tissue-wide, effect a specific regulator 
factor is expected to have. 

For each binding site identified, a conservation score can be created. The conservation score 
is selected to cover regions where the regulatory factor (e.g. TF) binding sites are identified as well as 
any other measurements that indicate conservation levels between the two species including but not 
limited to mouse and human. A binding site with higher conservation score or tiie corresponding gene 
with higher expression level could play a more significant role than those with lower scores. 
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The data generated can be collected and organized in a data bank, which can facilitate the use 
of the information in research and drug development efforts. 

It is emphasized, however, that it is not necessary to use this proprietary approach to practice 
the present invention. Databases that including mapping information of gene regulatory regions can 
be developed in many different ways. Accordingly, the present invention is by no means limited by 
the way of mapping and analyzing the regulatory factor binding sites of differentially expressed genes. 

Exan5)les of regulatory factor binding sites that can be identified in accordance with the 
present invention include, but are not limited to, the binding site for transcription feotor NF-kB 
(AGGGGACTTTCCCA ; SEQ ID NO: 1), and for E2F-1 CTTTGGCGG; SEQ ID NO: 2). 

If the initial information is a proteomic profile (e.g. a mass spectrum) showing differential 
protein expression levels, the conesponding genes are located and identified, and the list of genes and 
tiifiir corresponding protein expression levels are used in the subsequent analysis. 
C, Therapeutic Identification and Transcrip tion Factor Decoy Design 
In one specific application, the statistical analysis of regulatory binding sites performed in 
accordance with the present invention provides a facile way for identifying targets for therapeutic drug 
design, and for developing various therapeutic approaches directed to the targets identified, including, 
but not limited to, the design of oligonucleotide decoys. 

It is well possible that all diseases, including human diseases, are somehow associated with 
the gene transcription process. It is well known that germline mutations in genes encoding 
transcription factors result in malformation syndromes affecting the development of multiple body 
structures. Somatic mutations in genes encoding transcription factors have been shown to contribute 
tumorigenesis. In addition, prenatal development and postnatal physiology demonstrate that a single 
transcription factor can control the proUferation of progenitor cells during development, and the 
expression within the differentiated cells of genes products that participate in specific physiological 
responses. By way of example, well-studied transcription factors, such as p53, and the Smad and 
STAT proteins are known to play a major role in many cancas. Transcription factors have also been 
identified as being involved in various neuronal, cardiovascular, renal and infectious diseases, diseases 
of bone development, digestive diseases, diseases associated with abnormal skeletal development, and 
the like. For fiirther detaUs see, for example, Gregg L. Semaiza, Transcription Factors and Human 
Disease. Oxford Press 1998. 

Although the transcription factor protein-DNA interaction is sequence-specific, the binding 
site for one given transcription factor may vary by several base pairs within different target genes. 
The common part, or non-variable part, of the binding sequence for a particular transcription factor is 
referred to as the transcription factor consensus sequence. For example, the consensus sequence for 
transcription factor NF-kB is AGGGGACTTTCCCA (SEQ ID NO: 1); and for E2F-1 is 
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TTTGGCGG (SEQ ID NO: 2). The AP-1 transcription fector binds to the TGACTCA (SEQ ID NO: 
3) consensus sequence. The consensus sequence for the Sniad-3 transcription fector, which mediates 
TGF-P, activin and BMP-induced changes in gene expression is TGTCTGTCT (SEQ ID NO: 4). 

If any of such consensus sequences are enriched in a biological sample representing a disease, 
disorder or pa&ologio condition, the corresponding transcription fector is a promising target of novel 
therapeutic approaches directed to such disease, disorder or condition. 

According to the transcription fector decoy approach, small double-stranded oligonucleotides 
are introduced into cells to specifically bind to target transcription factors, thereby, preventing these 
factors fiom Iransactivating (i.e. "turning on") flieir target genes. 

Jo. preclinical sfaidies, pressure mediated ex vivo delivery of E2F Decoy has shown to prevent 
both neointimal hyperplasia and atherosclerosis in vein grafts of an animal model of vein graft 
transplantation. For more information, see, e.g. Ehsan, A., M.J. Mann 2001; Mann and D2au 2000; 
Mann et al. 1999; and U.S. Patent Nos. 5,766,901 and 5,992,687. 

Further details of the invention are illustrated by tiie following non-limiting examples. 

Example 1 

The method of tiie invention was applied to a set of cell cycle related gene expression data 
(Whitfield et al., 2002). Proper regulation of flie cell division cycle is crucial to the growth and 
development of all organisms; understanding this regulation is central to the study of many diseases, 
most notably cancer. 

The genome-wide program of gene expression during the cell division cycle in a human 
cancer cell line (HeLa) was characterized using cDNA microarrays. Transcripts of more than 850 
genes showed periodic variation during the cell cycle. Hierarchical clustering of the expression 
patterns revealed coexpressed groups of previously well-characterized genes involved in essential cell 
cycle processes such as DNA replication, chromosome segregation, and cell adhesion along with 
genes of uncharacterized function. Most of the genes whose expression had previously been reported 
to conelate with the proliferative state of tumors were found to be periodically expressed during the 
HeLa cell cycle. The data in this report provide a comprehensive catalog of cell cycle regulated genes 
that can serve as a starting point for the method of the present invention. The full dataset was 
retrieved from http://genome-www.stanford.edu/Human-CellCvcle/HeLa/ site for further analysis. 

hi order to identify the key elements involved in above differential expressed genes in cell 
cycles, the full-length sequences of these genes were retrieved, using the combination of UCSC 
genome browser ((Carolchik et al., 2003, Kent et al., 2002), MGC gene collection database and 
DBTSS databases. The transcription start site positions were mapped to the newest human genome 
working draft (McPherson et al, 2001, Lander et al., 2001) using the BLAT program. The sequences 
for core promoter regions (which is about 250 bp upstream and 50 bp downsti«am to the transcription 
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Tables 



Name of TF ! Freq. of TF binding in target gene fist 


Freq. of genomic TF binding 


P of Hypergeometric Distribution 


F9F.1 


v.OOIDIOlO/ 


0 428784151 


0 00000008" 


Flk.1 


U.39U9U9U9I 


0 469247702 


0 0003617* 


roA T 


U.39U9U9U9 1 


0586430144 


047923023 


MAZ 


U.JuUOUOUOl 


0525767189 


0 06789041 


TFII-I 


0.494949495 


0 536514308 


0 89462549 


HNF-4 


0.47979798 


0.468470802 


0.40087184 


c-Myc/WIax 


0.45959596 


0.402563771 


0.05840235 


E2F 


0.449494949 


0.244853036 


o.oooooor 


Xvenl-1 


0.444444444 


0.417713324 


0.24291237 


E2F-1/DP-1 


0.419191919 


0.171112262 


O.OOOOOOOf 


c-Ets-1(p54) 


0.288888889 


0.330182572 


0.04665969 


Sp3 


0.383838384 


0.369092322 


0.35791823 


TCF-1(P) 


0.353535354 


0.318205361 


0.15923196 


>Rel 


0.348484848 


0.302214165 


0.08983233 



In conclusion, the key transcription factors E2F-1 and Elk-1 have been identified as &ctors 
that may play the essential role affecting 850 genes with differential expression found during the 
specific cell cycles processes. The cell cycles have been shown crucial in many different kinds of 
tumor or canc^ developments. The immediate benefit firom this is that one can develop therapeutic 
strategies based on tiiese key elements. The transcription &ctor decoy (e.g., for E2F-1 Decoy, 
Corgentech Inc.) or anti-sense oligonucleotides are the exanq)les for such novel treatment options. 
The role of E2F-1 and Elk-1 in cell proliferations was gradually developed after numerous 
experiments and years studies. However, our invention make this time-consuming process an easy 
and.&st task. 

All references cited throughout the disclosure, and all references cited therein are hereby 
expressly incorporated by reference in their entirety. 

One skilled in the art will recognize many methods and materials similar or equivalent to 
those described herein, which could be used in tiie practice of the present invention. Indeed, the 
present invention is in no way limited to the methods and materials described. 
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WHAT TS CLAIMED IS: 



1. A inelhod for statistical analysis of differentially expressed genes, comprising: 

(a) obtaining a set of difiBerentially expressed genes ; 

(b) screening genomic sequences including the regulatory regions of said 
differentially expressed genes for the presence of regulatory fector binding sites; and 

(0) identifying at least one regulatory factor binding site enriched within said set 
of differentially expressed genes relative to a genome-wide or tissue-wide background. 

2 The method of claim 1 wherein in step (c) enrichment is determined by comparing the 
frequency or probability of the occurrence of the regulatory binding site or binding sites identified in 
step (c) within said gene set with Ihe frequency or probability of their occurrence in a genome-wide or 
tissue-wide background. 

3. The method of claim 1 wherein prior to obtaining said set of differentially expressed 
genes, aproteomic profile of a set of differentially expressed proteins is obtained. 

4. The method of claim 1 wherein said set of differentially expressed genes is part of a 
gene expression profile chamcteristic of a disease, disorder, or biological process. 

5. The method of claim 4 wherein said disease is selected from the group consisting of 
tumor oncological diseases, neurological diseases, cardiovascular diseases, renal diseases, infectious 
diseases, digestive diseases, metabolic diseases, inflammatory diseases, autoimmune diseases, 
dermatological diseases, and diseases associated with trauma or abnormal skeletal development. 

6. The method of claim 5 wherein said tumor is cancer. 

7. The method of claim 6 wherein said cancer is selected firom the group consisting of 
breast cancer, colon cancer, lung cancer, prostate cancer, hepatocellular cancer, gastric cancer, 
pancreatic cancer, cervical cancer, ovarian cancer. Uver cancer, bladder cancer, cancer of ihe unnary 
tract, thyroid cancer, renal cancer, carcinoma, melanoma, and brain cancer. 

8. The method of claim 4 wherein said disorder is a developmental disorder. 

9. The method of claim 4 wherein said biological process is associated with aging. 
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10. The method of claim 1 wherein said set consists of genes that show at least about 
two-fold differential expression relative to control. 

11. The method of claim 1 wherein said set consists of genes that show at least about 
four-fold differential expression relative to control. 

12. The method of claim 1 wherein said set consists of genes tiiat show at least about ten- 
fold differential erqnression relative to control 

13. The method of claim 1 wherein said regulatory factor binding site is identified within 
a region selected from the group consisting of a 5' upstream core promoter region, a 5' upstream 
enhancer region, an intron region, and a 3' regulatory region. 

14. The me&od of claim 13 wherein said regulatory factor binding site is a transcription 
factor binding site. 

15. The method of claim 14 wherein said transcription factor is selected from the group 
consisting of c-Fos, c-Jun, AP-1, Elk, ATF, c-Ets-1, c-Rel, C31F, CTF, GATA-1, POUIFI, NF-kB. 
P0U2F1, POU2F2, p53, Pax-3, Spl, TCP, TAR, TFEB, TCF-1, TFIIF, E2F-1, E2F-2, E2F-3, E2F-4, 
HIF-1, HIF-la, HOXAl, H0XA5, Sp3, Sp4, TCF-4, APC, and STAT5A. 

16. The method of claim 15 wherein said transcription factor is selected from the group 
consisting of E2F-1, E2F-2, E2F-3, NF-kB, Elk, AP-1, c-Fos, and c-Jun. 

17. The method of claim 1 wherein at least 50 differentially expressed genes are 

analyzed. 

18. The method of claim 1 wherein at least 100 differentially expressed genes are 

analyzed. 

19. The method of claim 1 wherein at least 500 differentially expressed genes are 

analyzed. 

20. The method of claim 1 further comprising the step of designing a treatment strategy 
based upon the identification of said enriched regulatory factor binding site. 

21. The method of claim 20 wherein said enriched regulatory &ctor binding site is a 
transcription fector binding site binding to at least one transcription factor. 
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22. The method of claim 21 wherein a consensus binding site is identified based on said 
enriched transcription factor binding site. 

23. The method of claim 20 wherein said treatment strategy relies on the design of a 
double-stranded oligonucleotide decoy, which con:q)etes with said enriched binding site for binding to 
the corresponding transcription factor. 

24. The method of claim 20 wherein said treatment strategy relies on an anti-sense 
oligonucleotide designed to bind to said enriched binding site. 

25. A method of designing a consensus regulatory factor binding site, comprising 
identifying a regulatory factor binding site enriched within a set of differentially expressed genes, 
relative to a genome-wide or tissue-wide control, and designing a consensus regulatory factor binding 
site consisting essentially of nucleotides shared by the regulatory factor binding sites enriched within 
said set of differentially expressed genes. 

26. A method of analyzing fhc enrichment of a regulatory &ctor binding site in a 
biological sample comprising a set of differentially expressed genes, comprising comp^xiag the 
frequency , or probability of the occurrence of said regulatory binding site within said gene set with the 
frequency or probability of its occurrence in a reference sanqile. 

27. The method of claim 26 wherein the biological sanq)le is a tissue sanq)le. 

28. The method of claim 27 wherein the tissue comprises tumor cells. 

29. The method of claim 28 wherein the tissue comprises, cancer cells. 

30. The method of claim 28 wherein tiie cancer is selected from the group consisting of 
breast cancer, colon cancer, lung cancer, prostate cancer, hepatocellular cancer, gastric cancer, 
pancreatic cancer, cervical cancer, ovarian cancer, liver cancer, bladder cancer, cancer of the urinary 
tract, thyroid cancer, renal cancer, carcinoma, melanoma, and brain cancer. 

3 1 . The method of claim 28 wherein the reference sample is a normal tissue of the same 
tissue type, 

32. The method of claim 28 wherein the reference sample is the human genome. 

33 . The method of claim 26 wherein the biological sample is a biological fluid. 
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34. The method of claim 26 wherein tiie enrichment is determined by using 
hypergeometric distribution analysis. 
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